Getting Started#

Welcome to DQMaRC (Data Quality Markup and Ready-to-Connect). DQMaRC is a Python-based tool for data quality profiling. We are pleased that you are interested to learn about DQMaRC and hope that it is useful for your data quality profiling needs. This guide will provide you with the foundational knowledge and practical steps needed to get started with DQMaRC, whether you’re a seasoned Python programmer or a non-technical user.

What is Data Quality?#

Data quality (DQ) refers to the condition of a dataset, assessing whether it meets the needs of its intended use. High-quality data is critical for effective and accurate decision-making in any field, particularly in healthcare, finance, and research. DQ errors are indicative of real-world problems arising from behaviours, processes, or systems.

DQMaRC evaluates data quality across six core dimensions as defined by the Data Management Association (DAMA) [1]:

Completeness
Validity
Consistency
Timeliness
Uniqueness
Accuracy

Click on a tile to learn more about each dimension.

Why Use DQMaRC?#

  1. In-Depth Cell-Level Markup: DQMaRC generates detailed cell-level markup of DQ issues, allowing precise identification of problematic data directly connected to the source.

  2. Multi-Dimensional Analysis: Provides custom metrics to evaluate DQ across six key dimensions as defined by DAMA, ensuring a thorough assessment.

  3. Plug-and-Play Functionality: DQMaRC offers readily available test parameters that allow users to quickly start assessing DQ without extensive setup or configuration.

  4. Easy and Thorough Customisation: Users can easily customise DQ test parameters either programmatically or in Excel, tailoring the tool to specific needs.

  5. Accessibility: Designed for both Python programmers and non-technical users, with a graphical user interface (GUI) built with Shiny for Python, lowering the barrier to entry.

  6. Flexible and Interoperable: The DQMaRC Python API can be integrated into a diverse range of applications, IT infrastructures, data processing and analysis workflows.

How does DQMaRC Work: The Role of Metadata#

Metadata is data about data. It provides context, meaning, and guidelines on how data should be captured, defined, structured, and represented. Metadata should be comprehensive, up-to-date, and thorough in its description of how data should be defined, captured, structured, and analysed. Robust metadata is the cornerstone of any DQ evaluation process as it forms the foundations of business rules and constraints during DQ assessment. DQMaRC uses metadata in the form of test parameters to perform DQ tests. The results of these tests generate a binary markup of 1’s and 0’s, i.e. flags indicating the presence or absence of an error for a given dimension of DQ. Overall, the workflow when using DQMaRC can be described in three key steps as illustrated in Figure 1: image showing DQMaRC workflow.:

  1. Identify and prepare the source dataset.

  2. Identify and define the relevant metadata, test parameters, and data standards.

  3. Analyse the DQ results.

DQMaRC workflow

Figure 1: image showing DQMaRC workflow.#

How To Access DQMaRC?#

See below how you can access DQMaRC either by installing it through pip, or by accessing the user friendly graphical user interface (GUI).

Python Installation#

If you want to run DQMaRC as a Python user, please follow the tutorial provided in the Backend Python Tutorial section.

To install DQMaRC into your python environment, please follow the instructions below.

To view the package dependencies, you can access the requirements.txt and/or environment.yml file from the DQMaRC GitHub Repository.

The key dependencies are:
  • Python (>= 3.9)

  • NumPy (>= 1.16)

  • pandas (>=2.2)

  • plotly (>=5.22)

  • shiny (>=0.10)

  • ipydatagrid (>1.3)

  • ipywidgets (>8.1)

Installation

  1. Install Python>=3.9:

Make sure you have Python>=3.9 installed. You can download it from the Python website.

  1. Install `virtualenv`:

Install virtualenv if you don’t have it already:

pip install virtualenv
  1. Navigate to the Appropriate Directory:

Open a terminal and navigate to your project directory.

  1. Create your own python virtual environment (must have python >=3.9)

python -m venv <environment_name>

Replace <environment_name> with your desired environment name.

  1. Activate the Virtual Environment:

    Windows:

    <environment_name>\Scripts\activate
    

    MacOS/Linux:

    source <environment_name>/bin/activate
    

6. Download and Install DQMaRC: Using the distribution wheel file:

pip install DQMaRC-1.0.0-py3-none-any.whl

Using PyPi:

pip install DQMaRC

Or clone directly from GitHub:

$ git clone https://github.com/The-Christie-NHS-FT/DQMaRC
  1. Verify the Installation:

You can verify that the environment is active and working by checking the installed packages:

pip list

The Shiny App User Interface#

If you prefer a graphical interface, please refer to the Frontend ShinyPy Tutorial section. This guide will walk you through the installation and use of the Shiny for Python interface to run DQMaRC without writing any code. We have built a frontend graphical user interface using shiny for python to encourage non-python users to use DQMaRC.

You can access a serverless, web-hosted version here: DQMaRC Shiny Front End. Please note this will run in your local web browser. For more information refer to Shinylive web hosting

If you installed DQMaRC using pip, you can also run the shiny app in terminal, bash, or Anaconda Powershell:

You can access a serverless, web-hosted version here: DQMaRC Shiny Front End.

Here is an overview of the front-end user interface. This is explained in more detail in Frontend ShinyPy Tutorial section.

DQMaRC shiny frontend graphical user interface.

Figure 2: Example data processed in the DQMaRC frontend graphical user interface built in shiny for python.#

Cite DQMaRC#

Please use the following citation if you use DQMaRC:

Lighterness, A., Adcock, M.A., and Price, G. (2024). DQMaRC: A Python Tool for Structured Data Quality Profiling (Version 1.0.0) [Software]. Available from christie-nhs-data-science/DQMaRC.

References#

[1] Government Data Quality Hub. (2021, June 24). Meet the data quality dimensions. GOV.UK. https://www.gov.uk/government/news/meet-the-data-quality-dimensions